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Abstract: This paper discusses a class of thresholding-based iterative se- 
lection procedures (TISP) for model selection and shrinkage. People have 
long before noticed the weakness of the convex Zi-constraint (or the soft- 
thresholding) in wavelets and have designed many different forms of noncon- 
vex penalties to increase model sparsity and accuracy. But for a nonorthog- 
onal regression matrix, there is great difficulty in both investigating the 
performance in theory and solving the problem in computation. TISP pro- 
vides a simple and efficient way to tackle this so that we successfully borrow 
the rich results in the orthogonal design to solve the nonconvex penal- 
ized regression for a general design matrix. Our starting point is, however, 
thresholding rules rather than penalty functions. Indeed, there is a univer- 
sal connection between them. But a drawback of the latter is its non-unique 
form, and our approach greatly facilitates the computation and the anal- 
ysis. In fact, we arc able to build the convergence theorem and explore 
theoretical properties of the selection and estimation via TISP nonasymp- 
totically. More importantly, a novel Hybrid-TISP is proposed based on 
hard-thresholding and ridge-thresholding. It provides a fusion between the 
io-penalty and the i2-pi2nalty, and adaptively achieves the right balance be- 
tween shrinkage and selection in statistical modeling. In practice, Hybrid- 
TISP shows superior performance in test-error and is parsimonious. 
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1. Introduction 

Lasso [30] has attracted people's a lot of attention recently because it provides 
an efficient and continuous way for variable selection, thereby achieving a stable 
sparse solution. Although in the orthonormal case it is well understood and has 
elegant theories [3, 7, 12], its shrinking and thresholding are not direct for a 
general regression matrix, and it suffers some problems in both selection and 
estimation [8, 37, 39]. There has been a large and rapidly growing body of 
literature for the lasso studies over the past few years. The efficient procedures 
proposed for solving the lasso include the well known LARS (Efron et al. [13]), 
the homotopy method (Osborne et al. [26]), and a recently rc-discovcrcd iterative 
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algorithm (Fu [17] Daubcchics et ai, Friedman et al. [16], Wu & Lange [33]). 
As for the theoretical aspects of the lasso, we refer to Knight & Fu [23], Zhao 
& Yu [37], Donoho et al. [11], Bunea et al. [5], Zhang & Huang [35], etc. for 
asymptotic and nonasymptotic results. Various extensions and modifications to 
lasso have also been proposed, such as the grouped lasso (Yuan & Lin [34]), the 
Dantzig selector (Candes and Tao [9]), the adaptive lasso (Zou [38]), and the 
relaxed lasso (Mcinshausen & Yu [25]). 

This paper aims to improve the naive Zi-penalty, by using nonconvex penal- 
ties, to achieve an effective and efficient procedure for model selection and 
shrinkage. The rest of the paper is organized as follows. Section 2 provides 
a mechanism to borrow the rich nonconvex penalties in the orthogonal design 
to solve the general problem. From the point of view of thresholding rules. Sec- 
tion 3 constructs the thresholding-based iterative selection procedures (TISP) 
for model selection and successfully builds the convergence theorem. Section 
4 investigates the theoretical properties of the selection and the estimation 
via TISPs nonasymptotically. In Section 5, we carry out an empirical study 
of TISP design which leads us to a novel Hybrid- TISP proposed based on hard- 
thresholding and ridge-thresholding. It provides a fusion between the ^o-penalty 
and the Z2-pcnalty, and adaptivcly achieves the right balance between shrinkage 
and selection in statistical modeling. In practice. Hybrid- TISP shows superior 
performance in both test-error and sparsity. Section 6 gives a real data example. 
All technical details are left to the Appendices. 

2. Motivation Prom Orthogonal Designs to Non-orthogonal 
Designs 

We consider the penalized regression problem 

mm i||X/3 -y\\l + Pif3; A)(^ /(/3)), (2.1) 

where X = [xi,X2, - ■ ■ ,Xp] is the regression matrix, y G i?" is the response 
vector, and P{f3; A) represents the penalty with A as the regularization parame- 
ter. Here p may be greater than n. In this paper, we assume /3 is sparse, and use 
(2.1) for predictive learning. Although predictor error or accuracy is our first 
concern, we prefer to obtain a parsimonious model that is more interpretative 
in practice and is consistent with Occam's razor. Usually P is assumed to be 
an additive penalty in the sense that P{f3; A) is obtained by a univariate P: 
P{f3; X) = J2 X)- This sparsity problem has wide apphcations in variable 
selection, functional data analysis, graphical modeling, compressed sensing, and 
so on. 

If P(/3; A) = A||/3||i, then (2.1) is the lasso [30], a basic and popular method 
in variable selection. However, although the ^i-norm provides the best convex 
approximation to the Zp-norm and is computationally efficient, the lasso cannot 
handle coUinearity [39] and may result in inconsistent selection (cf. the irrepre- 
sentable conditions [37]) and introduce extra bias in estimation [25]. 



Y. She/TISP for Model Selection and Shrinkage 3 

On the other hand, if we concentrate on orthogonal designs only, i.e., X = 
/, like in wavelets, li is far from the only choice. There are established theories 
and algorithms for various types of (nonconvex) penalties. 

Example 1. Hard-penalties. 

'-e^/2 + x\0\, if \e\ < A 

^AV2, if |6i| > A 
2. P{9; A) = A^/2 • le^o, which is in fact the ^o-penalty. 

'a|6i|, if \e\ < A 

AV2, if |6i| > A 



1. P{e; A) = <J ^ ' .7,,! 1\ , due to Antoniadis [1]. 



3. Pi9; X) = { [ !; .J 7, ^ . , due to Fan [14]. 



It is interesting to note that all three lead to the same estimator obtained by 
hard-thresholding. 

f A, if 61 < A 

Example 2. SCAD-penalty. P'{9; A) = < (aA - e)/{a - 1), if A < 61 < aA 

[o, if 61 > aA 

for 61 > and a > 2. The default choice of a is 3.7, based on a Bayesian argument 
(Fan [3]). 

Example 3. Transformed Zi-penalty. P{6;\) ~ A6|6'|/(l + 6|6'|) for some 
6 > 0, due to Gcman & Reynolds [20]. 

In this simplified setup, (a) the fitting part of the penalized regression (2.1) 
is separable in this case, which means we only need to deal with the univariate 
case, if P is also separable (which is true in general); (b) even if P is nonconvex, 
it still often results in a unique solution. 

One of our main goals in this paper is to borrow these rich results in the or- 
thogonal design to help us solve the general problem (2.1). We use the following 
mechanism to achieve this. Define 

g(/3, 7) = i||X7 -y\\l + P{r, A) + i < (J - S)(7 - /3),7 - /3 > . (2.2) 

Here <a,b>= a^b, S = X'^X. 

Given /3, minimizing g over 7 is equivalent to 



arg mm — 
7 2 



^ \l~'E)(3 + X^y] ' + P(7;A). (2.3) 



7 



In contrast to (2.1), this problem has an orthogonal design — as mentioned ear- 
lier this is easier to handle both in computation and in theory. For example, we 
may adopt some nonconvex penalties, and they still result in a unique solution 
of 7. 

Given 7, minimizing g over (3 is equivalent to 

argmini < (J-S)/3,/3-27> . (2.4) 
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Taking its derivative with respect to (3 gives (J — E)(/3 — 7) = 0. from which 
it follows that /3 = 7 if ||S||2 < 1- Note that (2.4) is a convex optimization. 
Therefore, the optimal value of g is always achieved at 7 = /3 if X is scaled 
down properly. 

The connection to the original problem is now clear: it is easy to verify 
min/3 g{f3, (3) is equivalent to min/3 f{fi). The advantage of optimizing g instead 
of / is that given (3, the problem is orthogonal and separable in 7, and we 
can adopt far more flexible penalties in the algorithm design, including the 
nonconvcx ones. 

3. Thresholding-based Iterative Selection Procedures (TISP) 
3.1. Thresholding Rules and Penalties 

As the title suggests, our starting point in this paper is thresholding rules rather 
than different forms of the penalty function. One direct reason is that differ- 
ent P's may result in the same estimator and the same thresholding, say, in 
the situation of hard-thresholding [1, 14]. Moreover, starting with thresholding 
functions facilitates the computation (as will be shown in the next subsection). 
Besides, there is also a universal connection between thresholding rules and 
penalty functions that we will investigate in this subsection. For convenience, 
we consider the univariate case only. 

A thresholding function, denoted by Q{-; A), with A as a parameter, is required 
to satisfy: 

1. 0(-; A) is an odd function. (0+(-; A) is used to denote the 8(-; A) restricted 
to R+ = [0,00).) 

2. 6 is a shrinkage rule: < e+(t; A) <t,yt e R+. 

3. 0+ is nondecreasing on 7?+, and 0+(t; A) ^ 00 as t ^ 00. 

In addition, it is natural to have 0+(i; A) = 0, < t < r for some r > 0. 

Given a thresholding rule 0(-;A), a penalty function can be obtained from 
the following three-step construction. First, define 

e-^iu; A) = sup{i : e(t; A) < u} and Q-^-u; A) = -9"^"; 
for any u G Then define 

s{u; X) ^ Q-\u; X) - u,yu. (3.1) 
Finally, let P be a continuous and positive penalty defined by 

r\e\ 

P(6l; A) = / s{u;X)du. (3.2) 

Antoniadis [2] showed the following result for this constructed P. 

Proposition 3.1. The minimization problem imiig{t — 6)'^/2 + P{0;X) has a 
unique optimal solution 9 = Q(t; A) for every t at which 0(-; A) is continuous. 
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In addition, if wc define Tjj{t) ~ t^Q[t), then it is the psi- function for defining 
M-estimators; see [2, 18]. 

Note that (3.2) is not the only way to construct a penalty that leads to 8 in 
solving the optimization. For example, in the situation of hard-thresholding, in 
addition to the continuous penalty 

P = Xy2-{\9\-Xfl^g^^j2 (3.3) 

constructed via (3.2), 



P{9; A) = <! ^ , and P(0; A) = - • l,^o (3.4) 

^ ' ^ AV2, if |0| > A ' ^ ' ' 2 ' 



are also valid choices [1, 14]. In some sense, (3.3) may be considered as a con- 
tinuous version of the discrete ^o-pcnalty. 



3.2. TISP and Its Convergence 



Now we go back to the mechanism introduced in Section 2 for the penalized 
multivariate regression problem (2.1), with P constructed from a given thresh- 
olding hmction e. Solving (2.3) yields 7 = e((/ - S)/3 + X^y; A). Seen from 
(2.4), our iterates simplify to 

^(j+i) = e((j _ ^-^fjU) + x^y; A). (3.5) 

This iterative procedure is referred to as the Thresholding-based Iterative 
Selection Procedure (TISP). TISP provides a feasible way to tackle the 
original optimization (2.1). It is a simple procedure that does not involve any 
complicated operations like matrix inversion. 

There are rich examples for the procedure defined by (3.5). (a) Using a soft- 
thresholding in (3.5), wc immediately obtain the iterative algorithm (in vector 
form) for solving the lasso problem where P(/3; A) = Aj|/3||i [10]. In fact, the 
asynchronous updating of (3.5) leads exactly to the component-by-component 
iteration referred to as the coordinate decent algorithm (see Friedman et al. [16]). 
The corresponding pathwise algorithm has been considered to be the fastest in 
solving the lasso problem to date, especially when p > n. (b) If we substitute 
hard-thresholding for 0, seen from (3.4), it is an alterative optimization for 
solving the penalized regression with 

P = c.^l^,^o = c-||/3||o, 

i 

i.e., the /o-pcnalized regression problem, (c) We can also replace the hard- 
thresholding by the more smoothed SCAD to reduce instability, (d) Finally, 
it is worth mentioning that TISP may also include the ridge penalty P{f3; A) = 
A|l/3||i/2, if we set 



e(t;A)^^, 



(3.6) 
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thanks to the generic definition of a thresholding function. 

Obviously, if S is nonsingular, and so ti > p, the TISP mapping is a con- 
traction and thus the sequence /j'-*^ converges to a stationary point of (2.1). We 
would like to apply TISP to large p problems as well where S is singular — 
a surprising fact is, however, that TISP may not be a nonexpansive operator^ 
for most thresholdings (except soft-thresholding), let alone a contraction. The 
following studies cover the large p case {p > n). We use n{A) to represent an 
arbitrary singular value of matrix A, and /ii„ax(^) (Mmin(^)) the max (min) of 
/z(A), respectively. 

Without loss of generality, suppose the penalty function defined by (3.2) 
satisfies the bounded curvature condition (BCC) for some symmetric matrix 
H: 

P{(3 + A; A) > P{(3- A)+ < A, s > -iA^i?A,VA e W (3.7) 

where s = s(/3; A) is given by (3.1). Many thresholding rules of practical interest 
including Example 1-3 satisfy the BCC with a positive semi-definite H. For in- 
stance, for soft-thresholding, if = since ||/3||i is convex; for hard-thresholding, 
H = I; for SCAD-thresholding, we can take H = //(a — 1) (recall that the 
parameter a is assumed to be greater than 2 in Example 2, and so H is positive 
definite). 

Theorem 3.1. Given the TISP (3.5), i/^max(S) < IV (2 - //,nax(-H')), then 

/(/3^^'^)>/(/3'^+'^). (3.8) 

Moreover, if /Xmax(S) < 1 V (2 — /i,„ax(-ff)), there exists a constant C > 0, 
dependent on X , H only, such that 

/(/3(-'")) -/(;3(-''+i)) >C-\\/3^^^ -f3^J+'^\\l. (3.9) 

Therefore, for an arbitrary X, we can use TISP of the following form in 
practice 

where ko = ^max(^) = H^lh, although larger values of ko generally lead to 
faster convergence. Applying Theorem 3.1 to some interesting special cases gives 
the following corollaries. 

Corollary 3.1. Suppose 8 is soft-thresholding. If ^nia^iX) < \f2, then (3.9) 
holds. 

Corollary 3.2. Suppose Q is hard-thresholding. If /iniax(-X^) < 1, then (3.8) 
holds; further, if UmaxiX) < 1, then (3.9) is true. 

^An operator T is called nonexpansive [4] if ||T(2:) — T(j/)|| < — y || for any x, y. Obviously, 
the hard-thresholding function is not nonexpansive. 
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Corollary 3.3. Suppose Q is SCAD-thresholding. // /^niax(^) < y^ — 
then (3.9) holds. 

Corollary 3.1 generalizes the lasso result by Daubechies et al. [10], and co- 
incides with our previous study [27]. Corollary 3.3 covers the orthogonal case, 

since SCAD assumes a > 2 and thus — > 1. Finally, it is worth point- 
ing out that TISP may not always be an MM algorithm [21] like the LLA 
method by Zou & Li [40]. Take the SCAD-thresholding as an example: when 

1 < ||X||2 < ^2 - g defined by (2.2) does not majorize / but TISP con- 
verges. Theorem 3.1 implies that if X is scaled down properly (which does not 
affect the variable selection), fifi'"''^) is nonincreasing all the time during the 
iteration process. 

We can easily show a result similar to Zou & Li [40]: 

Proposition 3.2. Suppose /^max(S) < 1 V (2 — /Xmax(-ff)). Give an initial point 
/3(0), i//3* is a limit point of the TISP sequence I3^^\ then 13* is a stationary 
point of f{f3) (2.1), or equivalently, a fixed point of (3.5). 

Denote by F the set of the fixed points of TISP. That is, given any /3* e F, 
it satisfies the implicit equation 

/3-e((/-S)/3 + X^y;A), (3.11) 

referred to as the Q- equation. Clearly, local minima of / are fixed points of 
(3.11). In the next section, we will perform an nonasymptotic study of the good 
properties of the points in F. Here, we give the following optimality result. 

Proposition 3.3. Let f3* G F and suppose /ii„ax(-ff) < 1- If Mmax(-ff) < 
^(S) < 2 — jii-a!i.y_{H) , then (3* is a global minimizer of f . 

Although the fact that nonconvex penalties often result in a unique optimal 
solution in the orthogonal design is well known, this proposition states (novcUy) 
that the same conclusion holds as long as X is not too far from orthogonal 
(characterized in terms of H). For instance, for SCAD thresholding and penalty, 
TISP necessarily leads to the global minimum of /, provided -y=f < fJ-{X) < 

^2 - or 0.61 < ^i{X) < 1.27 when a = 3.7 (the default choice in SCAD- 

see Example 2), given any initial point /S'"-*. In summary, TISP is a successful 
algorithm for solving the penalized regressions for a general design matrix. 

3.3. Related Work 

The main contribution of this paper is to consider a new class of 0- estimators 
defined by the 0-equation (3.11) for model selection and shrinkage, which can be 
naturally computed by TISP, and are associated with penalized regressions — 
in particular, the penalty P can be constructed via the three-step procedure 
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introduced in Section 3.1. More generally, the A in (3.11) can be component- 
specific. For example, if X is not column-normalized, we may use 

/3 = e((/-£)/3 + X^y;A), (3.12) 

where A = [ A|ja;i||2 A||a;2||2 ■■■ A||a;p||2 and A is a regularization pa- 
rameter. With a carefully designed Q, we obtain a good estimator with both ac- 
curacy and sparsity, as will be shown in Section 5.2. The corresponding penalty 
is, not surprisingly, nonconvex, which indicates the difficulty of this NP-hard 
problem. 

Nonconvex penalties have been successfully used in real-world applications 
like high-dimensional nonparametric modeling [3] , survival analysis [6] , and mi- 
croarray data analysis [32, 36], where they achieve outstanding performance. 
The numerical optimization has been a challenging and intriguing problem. In 
the context of wavelet denoising where XX^ = /, Antoniadis & Fan [3] pro- 
posed the ROSE to approximately solve the minimization problem for a wide 
class of nonconvex penalties. They also introduced the graduated nonconvexity 
(GNC) algorithm, developed in image processing; it has a number of tuning 
parameters and is computationally intensive. Fan and Li [15] then proposed a 
generic local quadratic approximation (LQA) algorithm by solving a series of 
^2-penalized problems. Like ridge regression, this approach does not intrinsically 
yield zeros, and setting a small cutoff value during iteration has been shown to 
be too greedy. A refined version is the perturbed LQA suggested by Hunter & 
Li [22] to avoid numerical instability. The perturbation parameter needs to be 
chosen very carefully in implementation since it affects the sparsity of the solu- 
tion as well as the speed of convergence. Recently, Zou & Li [40] proposed a new 
local linear approximation (LLA) which significantly improves the LQA. Explicit 
sparsity is attained by solving a weighted lasso problem at each iteration step. 
(Note that our TISP does a simple thresholding at each step.) One-step SCAD 
estimator is advocated. Our empirical studies show that this one-step convex 
approximation has limited power in finite samples. Although the estimate is 
sparser than using the plain Zi-penalty, it may result in misleading models with 
poor prediction error. See Section 5 for detail. 

Using thresholding rules to define 0-estimators shares similarities to the stud- 
ies of M-estimators of ip-type in robust regression. Most M-estimators were 
proposed in the form of T/j-functions but not based on loss functions, such as 
Huber's, Hampel's three-part, and Tukey's bisquare M-estimators. Indeed, we 
find an interesting connection between these two fields. Assume a mean shift 
outlier model, y = X(3 -I- 7 -|- e, e ~ N{Q, a'^I), where n > p and 7 € i?"" is 
sparse. If 7, is nonzero, case i is an outlier. Let H = X(X'^ X)~^X^ be the 
hat matrix and suppose its spectral decomposition is given hy H ~ UDU^ . 
Define an index set c — {i : Da = 0} and Uc is formed by taking the corre- 
sponding columns of U. Then a reduced model can be obtained from the mean 
shift outlier model 



y = Aj + e', e' ~ A^(0, cr^/(„_p)x(„_p)), 



(3.13) 
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where y = U'^y, A ~ C/J (of dimension (n — p) x n). Applying 0-TISP to 
this 'large-p' sparsity problem gives the iteration 'y^^+i) = Q{A^y/kQ + (J — 

A^A/fc2)^0);A/fc2) with fco = tl,nax{H), Or 

^ e(if^O) + (/ - H)y, A). 

After getting 7 ft-om TISP, we can estimate /3 by /3 = (X^X)-iX^(y - 7). 
Simple algebra shows that this special 0-TISP solves an M-estimation prob- 
lem associated with V', if (©, V') satisfies A) -I- ipit; A) = t. It is well known 
that Huber's method (or equivalently, Soft-TISP, which corresponds to using 
a convex /i-penalty on 7) behaves poorly in outlier detection even for mod- 
erate leverage points. Instead, redescending ■(/'-functions are advocated, which 
corresponds to using nonconvex penalties for the sparsity problem of (3.13). 

4. Selection and Estimation via TISP 

TISP provides a very simple way to do variable selection via penalized regres- 
sions. In this section, we will perform a theoretical study of the variable selection 
and coefficient estimation by TISPs based on different thresholdings. Our results 
are nonasymptotic. 

4.1. Assumptions on 

Given 8(-; A), denote its thresholding value by r(A), i.e., 8(t; A) = V< : |t| < r 
and 8(t; A) 7^ for \t\ > r. For example, r(A) = A in soft-, hard-, and SCAD- 
thresholdings, but is not so for the transformed li. Assume r > 0. To ease our 
TISP study based on the 0-equation (3.11), we define another version of s, 
called the generalized sign. Introduce 

Sgn(M; A) = {s G i? : 6(m + rs; \) = u} if u G ran(9), 

and Sgn(it; A) = {0} otherwise, where ran(9) is the range of 6; sgn(M; A) is used 
to denote a specific element in Sgn(w;A). The vector versions of Sgn and sgri 
can be defined correspondingly. Clearly if u = Q{t; A) then t ^ u + TSgn{u; A) 
for some sgfL(M; A) G Sgn(u; A). 

As a demonstration, if 9(-;A) is soft-thresholding, r = A and Sgn(/3) = 
{s : s, = 1 if Pi > 0, Si = -1 if f3i < 0, and Si G [-1, 1] if Pi = 0}. Thus now 
Sgn(/3) is the subdifferential of and sgii(/3) is a subgradient [29]. For hard- 

thresholding, Sgn(/3) = {s : Si = if Pi 7^ 0, G [-1, 1] if Pi = 0}. Sgn and sgii 
are called generalized signs due to the following fact. 

Proposition 4.1. Suppose Q{-; A) is sandwiched by soft- and hard-thresholdings, 
&si-;T) and e//(-;T), i.e., 

iQs)+{t;r) < e+(i;A) < (eff)+(t;T),Vi G R+. (4.1) 

Then < sgn{u) < I if u > 0, —1 < sgn{u) < ifu<0, and sgn(0) G [—1, 1]. 
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This proposition is easy to prove from the non-decreasing property of O. 
Throughout the rest of the section, we assume 8 always satisfies the sandwiching 
condition (4.1). By the definition of the generahzed signs, (3.11) is equivalent 
to S/3 = X^y — rsgn(/3;A), for some sgn(/3; A) € Sgn(/3;A). We study the 
TISP estimate based on the scaled form (3.10). Let ^9 be a fixed point of (3.10) 
and suppose r(A) = ct{X/c) for any c > 0. Then the 0-equation for this TISP 
estimate can be rewritten as 

I]^ = X^y-rSgh(/3;A/fc2), (4.2) 

where fco = j|^l|2- 

4-2. Sparsity Recovery 

Recall that y = X(3 + e, e ~ A^(0, cr^I), and (3 is sparse. Let z = {i : Pi = 0}, 
nz = {i : (5i ^ 0}, dz — \z\, dnz = \nz\. To study the sign-consistency of a TISP 
estimate, we denote by ps the probability of successful sign recovery, that is, the 
probability that there exists a /3 G F such that sgn(/3) = sgn(/3). 

To simplify asymptotic discussions, we assume X has been scaled to have 
all column Z2-norms equal to \/n. Define S'*-* = S/n. To get a better form 
of the bounds for ps, we define two quantities /i = /iniin(S^^j „^) and k = 
maxjlS^"^^ II 2/ Vc^nz- Intuitively, k measures the 'mean' correlations between the 

relevant predictors and the irrelevant predictors. The following nonasymptotic 
result is always true regarding the selection via TISP. 

Theorem 4.1. Assume ^ > ndnz, /i > and min|/3„^| > then 

Ps > [1 - 2$(-M)]'^= [1 - 2$(-L)]'^"^ , (4.3) 

where M = (l- L = ^ ( min |/3„ J - ^^^V and $ is the stan- 

dard normal distribution. 

Corollary 4.1. Under the conditions of Theorem 4-1, we have 

l-Ps< 2dzip{M) /M + 2dnz'p{L) /L, (4.4) 

where ip is the standard normal density. 

Clearly, the size of n is very important. A small value of k weakens the 
interference of X^ and X„z and helps recover the sparsity correctly. We can 
also use this theorem to explore some asymptotics. (i) Assume /3, d^, and dnz 
are fixed, n —f 00, then under some regularity conditions we get: if t j ypn 00 
and r/n 0, then TISP is sign consistent. This result in the Soft-TISP (lasso) 
case coincides with other studies like [23, 37]. (ii) Suppose /3„j, and dnz are fixed, 
n,dz — > 00, and fJ. > {1 + e)Kdnz for some e > 0. Then TISP can successfully 
recover the sparsity pattern of /3 if dz(p{M)/M — > and r/n 0, which only 
requires n to grow faster than logdz. 
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Unfortunately, the regularity condition /i > Kdm cannot be removed in gen- 
eral. In the lasso case, it is a version of the irrepresentable conditions [37]. (We 
took this more restrictive form because it is more intuitive and leads to more 
nice-looking bounds in (4.3) and (4.4).) However, for hard-thresholding-like 0's, 
this is unnecessary and we can obtain stronger results. 

We say that Q belongs to the hard-thresholding family if 

Q{t;X)^t,yt:\t\>c-T, (4.5) 

for some constant c > 1. Hard-thresholding and SCAD-thresholding are two 
examples with c = 1, a respectively. Unlike soft-thresholding, they do not intro- 
duce bias for large nonzero components. 

Theorem 4.2. Suppose Q belongs to the hard-thresholding family andmiii \(3„^\ > 
cr/fep. Then 

Ps>[l- 2^{-AI')f' [1 - 2$(-i')]''"^ (4.6) 



where M' — 




Corollary 4.2. Under the conditions of Theorem 4-8, we have 

1-Ps< 2dMM')/M' + 2dnML')/L'. (4.7) 

(4.6) is strictly better than the bound in (4.3) if c < dnzk"^ / [l^n) , or c < 
dnzA'max(S'-*-')//^min(S^*2)5 wliicli is usually truc for both hard- and scad-thresholding. 
Therefore the TISP induced by a 8 in the hard-thresholding family can achieve 
better performance in variable selection.^ This will be verified empirically in 
the next section. Note that although in the orthogonal case, hard-thresholding 
and soft-thresholding give exactly the same zeros, they result in very different 
sparsity patterns in our iterative procedure for a nonorthogonal X. 



4-3. Estimation Risk 



We obtain the following TISP risk bounds for any thresholding O. 

Theorem 4.3. Let v = ^mi„(S(';^) and /3 e F. Define R^^ = £'(||/3,„ 
and Rz — E{\\f3^\\2). Suppose S is nonsingular. Then 



Rnz < — 

n 



dn 



V2 



d t"^ d d 

h K 5— • nRz 



(4.8) 



And 



Rz < liK^M + K2 — MM), 



(4.9) 



-Note that, however, the regularization parameters are generally tuned to reduce the test 
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1+ 



where M is defined as in Theorem 4.1, Ki= 6 / """/"/^^'^i , = 6 1 - k'^^^^^ 

in which we assume < and ^ > ndnz ■ 

This general result holds for any TISP estimate. It is not difficult to show 
that in the previous setting of (i) where (3 is fixed, we also obtain i?2 — > 
and Rnz — > by the theorem if tJ %/n co and r/n ^ oo. Besides, under the 

conditions stated in the theorem. M — \J 2 log ^ + (1 + e) log log ^ is sufficient 



to ensure i?^ for any e > 0; since M ^ \J2 log < ^/2logc?7, Donoho & 
Johnstone's classical work [12] in the orthogonal design implies this risk bound 
can not be improved significantly in general. We leave the TISP design problem 
to the next section using an empirical study. 

In the orthogonal case, we can show the oracle inequalities [12] hold. 

Theorem 4.4. Suppose Q satisfies the sandwiching condition (4.1) and X = 
/. Then 

E0 -f3\\l<il+ r') jZ min (^-^ + Pi o^) (4.10) 



1 



for any r > 1. Consequently, when t = logn, 

i?||^^/3||2< (21ogn+l) { Vmin(/?f,cT^)') (4.11) 

V VTi" log' n ^ J 

for any n > 2. 

This nonasymptotic result covers soft-, hard-, and SCAD-thresholdings. It 
coincides with the classical soft-thresholding studies [12] and is sharper than [3, 
38] . (A correction of Zou's oracle bound [38] is also given at the end of the proof; 
see Appendix A. 4.) 



5. TISP Designs: An Empirical Study 
5.1. A numerical study of TISPs 

In this section, we demonstrate the empirical performance of TISPs by some 
simulation data. Although there are rich choices about O in (3.5), we focus on 
three basic TISPs only in this subsection. In addition to the Soft-TISP, i.e., the 
lasso, we implemented Hard-TISP and SCAD-TISP, the thresholdings of which 
belong to the hard-thresholding family. The parameter a in SCAD-thrcsholding 
takes the default value, 3.7, based on a Bayesian argument [15]. As seen from 
the theoretical studies in Section 4, the last two should perform better than 
the lasso in variable selection. In generating the solution path for a grid of 
A- values, we always set the initial point, /S'^-*, to be zero in Hard- or SCAD- 
TISP. A natural search range for A, seen from (3.11), is [0, X'^y], if X has been 
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column normalized. (Note that a pathwise algorithm with warm start, which 
takes the previous estimate associated with the old value of A as the initial 
point of the procedure for the current value of A, may be inappropriate for 
TISPs when nonconvex penalties are used. In fact, the solution path associated 
with a nonconvex penalty is generally not continuous in A and warm-start leads 
to bad solutions because of multiple local minima effects.) 

For comparison, the one-step LLA method, proposed by Zou & Li [40] for 
penalized likelihood models, is also included in our tests. They showed good 
asymptotics about one-step SCAD when n — > oo and p is fixed, and demon- 
strated its performance in various numerical examples. The one-step LLA is 
actually a weighted lasso with weights constructed from the OLS estimate using 
different penalty functions. According to our general result of weights in sparse 
regression [27] , it can achieve better sign consistency than the lasso as n grows 
to infinity. We are greatly interested in drawing a comparison between TISP 
and LLA since TISP also successfully solves the penalized regression problems. 

We did experiments on two simulation datasets. Each dataset contains train- 
ing data, vahdation data, and test data. We use # ="•/ • /•" to denote the 
number of observations in the training data, validation data, and test data. Let 
S be the correlation matrix in generating X, i.e., each row of X is indepen- 
dently drawn from A^(0, S). We use ({ai}"i , • • • , {afc}"'") to denote the column 
vector made by ni oi's, • • • , n^. a^-'s consecutively in the following examples. 
Example 1. # = 20/100/200, d = 8, /3 = ({3}i, {1.5}i, {O}^, {2}\ {O}^), 
Ejj ^ pl*^^! with p = 0.5, a = 2,3, 5, 8; the corresponding signal-to-noise vari- 
ance ratio {(3^1^(3 /a'^) is 5.31, 2.36, 0.85, and 0.33, respectively. 
Example 2. # = 20/100/200, d = 8, /3 = ({3}i, {1.5}i, {O}^, {2}\ {O}^), 
Ey = with p = 0.85, (7 = 2, 3, 5, 8; the corresponding signal-to-noise 

variance ratio is 8.21, 3.65, 1.31, and 0.51, respectively. 

Before an algorithm is applied, the columns of a regression matrix are all nor- 
malized to have a squared /2-norm equal to the number of the observations; no 
centering is performed in these examples. 

Each model is simulated 50 times, then, we measure the performance of each 
algorithm mainly by test error and sparsity error. The test error is characterized 
by the 40% trimmed-mean of the scaled MSE (SMSE) on the test data, where 
SMSE is 100 • (Ej=i(y« - y»)V(^cr2) _ defined for the test data. (Medians of 
MSEs are mostly used [30, 39] to measure the performance from multiple runs, 
but are not so stable for comparisons based on our experience.) The sparsity 
error here is defined by the 40% trimmcd-mcan of the following 50 percentages: 
100 • |{i : sgn(/3i) ^ sgn(/3i)}|/(i, which represents the number of inconsistent 
signs for each estimate compared to the true (3. We also summarized the proper 
zero percentages, 100% • |{i : f3i = 0,$i ~ 0}|/|{« : f3i = 0}|, and the proper 
nonzero percentages, 100% • \{i : (3i ^ 0, Pi ^ 0}|/|{i : Pi 7^ 0}| in the table 
as follows. The numbers in parentheses are the standard errors of the trimmed 
means of SMSE, estimated by bootstrapping the SMSE 500 times as in [38]. 
The total computing time (in seconds) for each algorithm is also included. 
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Total computing time 


284 


599 
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2234 



Table 1 

Performance comparisons on the simulation data, in terms of test error, sparsity 
error, proper sparsity, and proper nonsparsity - all the numbers are 40% 
trimmed-mean of the 50 simulations. Six methods are listed here: lasso (Soft-TISP), 
one-step SCAD, Hard-TISP, SCAD-TISP, elastic net (cNet), and Hybrid-TISP; the 
last two both have two regularization parameters. The last row gives the total 
computing time in seconds. 
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First, although Zou & Li's onc-stcp SCAD brings more sparsity than the lasso 
estimate (seen from the proper-sparsity and proper- nonsparsity), it is often the 
worst in terms of test error. This is because the one-step SCAD is indeed a 
weighted lasso method and the OLS estimate used for weight construction may 
not be trustworthy, if, say, there is large noise, or high correlation between 
some variables. This phenomenon is serious in Example 2 where the OLS esti- 
mate can be unstable and misleading. Our Hard-TISP and SCAD-TISP clearly 
showed the remarkable parsimoniousness brought by nonconvex penalties. In- 
stead of solving a Zi-constrained convex approximation as in the LLA method, 
our TISPs directly tackled the original nonconvex penalized regressions and 
demonstrated better performance in both test-error and sparsity-error. (In fact, 
we doubt if the Zi-based one-step SCAD is truly able to solve the SCAD pe- 
nalized regression, seen from the convex approximation in its derivation, and 
after comparing its estimate to the SCAD-TISP.) Hard-TISP and SCAD-TISP 
do not differ much here, which verifies the previous theoretical results regarding 
the hard-thresholding family in Section 4. 

Hard-TISP and SCAD-TISP achieve smaller test error than the lasso which 
may introduce extra bias when the signal-to-noise ratio is medium or high. In- 
terestingly, when the noise level is very high, the lasso (Soft-TISP) yields a more 
accurate estimate than the two. This is in fact not so surprising. In predictive 
learning, to reduce the test error, when the noise is relatively large compared to 
the signal, it is also necessary to shrink the nonzero coefficients even if the true 
ones are far from zero. In either hard- or SCAD-thresholding, there is basically 
no shrinkage offered for large nonzero coefhcients, while the lasso does this by 
soft-thresholding (although the shrinkage amount is the same as the threshold- 
ing value). Fortunately, TISP still gives us good selection results and achieves 
parsimonious models. We can apply, for example, a second-time shrinkage to 
the coefficients of the selected variables. Of course, a better strategy is to take 
into account these two concerns - selection and shrinkage - simultaneously and 
adaptively in building a model as probed in the next subsection. 

5.2. Hybrid-TISP for model selection and shrinkage 

To deal with the low SNR problem, a promising approach is to modify the 
thresholding in Hard-TISP to include adaptive shrinkage for nonzero coefficients. 
Motivated by the thresholding function of ridge regression given by (3.6), we 
propose a novel /i?/6rirf-thresholding: 



The penalty constructed via the mechanism introduced in Section 3.1 is made 
up of two quadratic parts: 




(5.1) 




(5.2) 
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Wc have seen the first quadratic part in the continuous hard-penalty (3.3) (which 
leads to the same solution as the discrete lo-penalty); the second part resembles a 
ridge penalty. See Figure 1 below. Note that the knots ±A/ (1 + rj) are dependent 
on rj, too. Simple calculations show that this P satisfies the BCC (cf. (3.7)) with 
H = I, and Theorem 3.1 holds. We can apply (3.10) given an arbitrary design 
matrix. The corresponding TISP (referred to as Hybrid-TISP) converges. The 
0-equation (3.11) implies the nonzero components of a Hybrid-TISP estimate 
result from a partial ridge regression. This fact can be used in implementation 
when the maximum number of iterations allowed has been reached. 





Fig 1. The penalty defined by hybrid-thresholding. As A and 77 vary, it takes the con- 
tinuous hard-penalty and the ridge penalty as extremes. 



Moreover, we have the following nonasymptotic result in parallel to Theorem 
4.2. Recah that fco = ||X||2, S^"' = S/n = X'^X/n, pt = /in,in(S(,"j_„ J and 

K = max| I S^'jJ^ 1 1 2 / Vc^nz • Define t = min|(S„, + 77/)^"'"S„2/3„^|, the minimum 

absolute value in the noiseless partial ridge estimate. Let Pe be the probability 
of Hybrid-TISP estimates having incorrect sparsity patterns, that is, for any 
13 & F, there exists some i or j such that ^ 7^ or /S^z.j = 0- 
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Theorem 5.1. Assume u > 0, and A, n are chosen such that k < ^ — ^ — 75= ■^^^^rr 

1 1 71 - 1 1 2 V ^nz TL7j 

and L > TT^, — . Then 

Pe < 2d,^{M")/M" + 2d„ML")/L", (5.3) 

where M" = ^(\~ -J^^nWlS^AU^/d^^i , L" = ( l - j^) . 

Hybrid-TISP successfully offers both selection and shrinkage in estimating /3. 
Before going into the numerical results, we summarize the traits of the design 
of Hybrid-TISP as follows, (a) Its penalty provides us a trade-off between the 
lo-penalty and the Z2-penalty (ridge-penalty), and takes the two as extremes, 
from which we secure selection and shrinkage simultaneously. In particular, the 
selection is achieved by a penalty more like Iq than Zi, seen from the penalty 
function, or the iterative thresholding, (b) Hybrid-TISP avoids double shrink- 
age. Double shrinkage is a serious problem in the design of naive elastic net [39] 
which simply adopts a linear combination of the Zi-penalty and the Z2-penalty. 
However, the ^i-penalty also plays a role in shrinking the nonzero coefficients 
in addition to the ?2-penalty. By contrast, Hybrid-TISP deals with the zeros 
and the nonzeros separately, by hard-thresholding and ridge-thresholding, re- 
spectively; there is no overlapping between them, (c) We have two parameters, 
A and ry, responsible for selection and shrinkage respectively. One drawback 
of the lasso is that it uses the same parameter to control both selection and 
shrinkage [24]. Therefore, it may result in insufficient zeros even if the SNR is 
pretty high, as shown clearly in Table 1. Hybrid-TISP has A, 77 designed for the 
two different purposes and can adapt to different sparsity and noise level, (d) 
The TISP selecting and shrinking interplay with each other during the itera- 
tion till in the end we successfully achieve selection/shrinkage balance in the 
final estimate. This is in contrast to the relaxed lasso [24] which treats selection 
and shrinkage as separate steps in building a model, (e) Finally, Hybrid-TISP 
is a very simple procedure to implement. It only involves multiplication and 
thresholding operations. 

In the implementation of Hybrid-TISP, an empirical parameter search is usu- 
ally needed to determine the values of A and ry, because running a grid search 
over the (A, 77)-space is a formidable task. We search along a couple of few one- 
dimensional solution paths including the A-paths (with rj fixed) and the 77-paths 
(with A fixed) to save computational cost. The optimal tuning parameter from 
the ridge regression path (corresponding to A = 0), denoted by 77^''^ is used as 
a reference for r]. Briefly, our search process generates and searches along some 
A- and 77-paths, compares the results from these searches, and then takes (A, rf) 
to be the one minimizing the validation error. The concrete search paths are as 
follows, (i) n > p. Denote the OLS scale estimate by a. If n/p < 5, or n/p < 10 
but (T > 5, we adopt the alternative search strategy which has been shown to 
be fast and efficacious [27]: fixing 77 at 0.5r]'^^\ search along the A-path to get an 
optimal solution (having the smallest validation error) at, say. A*-"-*; then search 
along the 77-path with A fixed at A*-"^. If n/p > 10 and ct < 5, we only search over 
the A-path with tj = O.OSyy^''). In aU remaining cases, we generate and search 
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long two A-paths with t] = O.brj'^^^ and 0.05?7^''^ respectively, (ii) p > n. We use 
the above alternative search starting with 0.5r]^'^\ and an additional search for 
A with 77 fixed at 0.05ri^'^\ Accordingly, 3 paths in total are generated in the 
large-p situation. This simple empirical search does not cover the full parameter 
space but is more efficient than a grid search. The results are reported in Table 
1. We also included the elastic net (eNet) in the experiments, which has two 
regularization parameters as well. Note that eNet generates and searches along 
6 solution paths to tune the parameters [39]. 

Seen from Table 1 , Hybrid-TISP has amazing performance in both accuracy 
and sparsity. We briefly summarize the story as follows. When the noise level is 
low or medium, the value of A in the lasso is limited by the amount of shrinkage 
and thus gives insufficient sparsity. Large noise alleviates the problem but there 
is still much room for the improvement of test-error and sparsity-error because 
the amount of shrinkage may not equal to the thresholding value in the selection. 
The weighted lasso like the one-step SCAD has somewhat limited power because 
the OLS estimate may be inaccurate and misleading for weight construction. 
Benefiting from the /2-pGnalty, the eNet shows much better accuracy in the case 
of large noise and/or high correlation between the variables; nevertheless, the 
sparsity of the estimate may be seriously hurt when the ridge penalty must take 
control. And it seems possible to improve its test-error further by incorporating 
this sparsity in estimation. All of these problems can be resolved by Hybrid- 
TISP, which achieves the right balance between shrinkage and selection. Its test 
error is consistently lower than the eNet, and more importantly, Hybrid-TISP 
provides a parsimonious model as Hard-TISP. 

5.3. Large sample and large dimension experiments 

At the end of this section, we demonstrate the performance of TISP on large-n 
data as well as on large-p data. We modified the parameters in Example 1 and 
reran the simulations, where = pl'^-'l with p = .5, (3 is appended with zeros 
given by [3, 1.5, 0, 0, 2, 0, 0, • • • , 0]^, cr = 2, 5, and n, d are not fixed anymore: 
in the large sample experiment, d = 8, n = 40,80,200 (corresponding to 5 
times, 10 times, and 25 times as large as d); in the large dimension experiment, 
n = 20, d = 100, 200, 500 (corresponding to 5 times, 10 times, and 25 times as 
large as n). Table 2 shows the simulation results of these different combinations 
of n and d. In both situations, the Hybrid-TISP path is preferable in terms of 
accuracy and sparsity. Our conclusions are similar to the findings summarized 
before. Note that one-step SCAD uses the OLS estimate as the initial guess 
and thus is not included in the large-p simulation. In fact, as an example of the 
adaptive lasso, it is most powerful in large samples with small noise and low 
correlation between covariates, where the OLS estimate is accurate. The elastic 
net is an improvement of the lasso and provides a good algorithm in predictive 
learning. However, despite having two regularization parameters, it does not 
improve much the sparsity of the lasso. Hard- and SCAD-IPOD give significantly 
different solution paths than the above convex penalties. They may dramatically 
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reduce the sparsity error and the test error, say, for large-p sparse signals with 
moderate noise. Both thresholdings fall into the hard-thresholding family which 
does not introduce much estimation bias for large coefficients. Interestingly, if 
our main concern is to reduce the test error in building a statistical model 
(which is the most frequently used tuning criterion in implementation), they 
are not always our best choices. Indeed, it is more desirable to offer adaptive 
shrinkage to nonzero coefficient estimation to benefit from the bias-variance 
tradeoff. Hybrid- TISP is successful especially for the large-p data because it 
does joint and adaptive selection and shrinkage. 

6. Real Data 

Hybrid-TISP was applied to a real prostate dataset which was used by Tibshi- 
rani [30]. The prostate data have 97 observations and 9 clinical measures. In this 
example, unlike [30], we take the log(cancer volume) (Icavol) as the response 
variable and consider a full quadratic model; the 43 predictors are 8 main effects, 
7 squares, and 28 interactions of eight original variables — Iweight, age, Ibph, 
svi, Icp, gleason, pgg45, and Ipsa, where svi is binary. The lasso does not 
give stable and accurate results for this example due to the existence of many 
highly correlated predictors. 

The regularization parameters of Hybrid-TISP were tuned by leave-one-out 
cross-validation. To identify the relevant variables in a trustworthy way, non- 
parametric bootstrap resampling was used with B = 100. For every bootstrap 
dataset, after standardizing the predictors, we apply Hybrid-TISP with fixed 
regularization parameters tuned for the original dataset. Figure 3 shows the 
percentages of the bootstrap coefficient estimates being nonzero over the 100 
rephcations for all the 43 predictors. The histograms are plotted in Figure 2. 
It is easy to see that 8 variables are much more significant than the others. 
In fact, these are exactly the variables selected by Hybrid-TISP on the orig- 
inal data. A more careful examination shows that they appear (jointly) 36 
times in the selected models, the top visited octuple in bootstrapping. These 
variables fall into two groups with similar patterns: (I) {xs, a;i9, a;25, a;38}, i.e., 
{icp, lweight*lcp, age*lcp, gleason*lcp}; (II) {a;8, 0:22, a;28, X42}, i.e., {Ipsa, 
lweght*lpsa, age*lpsa, gleason*lpsa}. The within-group correlations are 
very high, > .98 for Group (I), and > .93 for Group (II). Furthermore, an 
interesting feature is that for any of the eight variables selected by Hybrid- 
TISP, the other three in the same group are most correlated with it among 42 
predictors. 

7. Discussion 

We have proposed the thresholding-based iterative selection procedures for solv- 
ing nonconvex penalized regressions. In fact, people have long before noticed the 
weakness of the convex /i-constraint (or the soft-thresholding) in wavelets and 
have designed many different forms of nonconvex penalties to increase model 
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Test- err 
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0.0 


9.2(1.9) 
4.5 


10.7(2 7) 

29.7 
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Prop-NZ 


53.0% 
100% 
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100% 
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100% 
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n = 40, d = 8, 

a = 5 


Test- err 
Spar-err 


12.6(3.0) 
29.8 
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25.0 


13.9(2 4) 

16.9 
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n = 80, d = 8, 

£7 = 2 
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53.3% 


99.3% 
51.6% 


95.8% 
66.7% 


99.0% 
66.7% 


n = 20, d = 500, 
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99.0% 
33.3% 


99.3% 
51.3% 



Table 2 

Performance comparisons on the simulation data with large sample size or 
dimensionality, in terms of test error, sparsity error, proper sparsity, and proper 
nonsparsity over 50 simulations. Six methods are listed here: lasso (Soft-TISP), 
one-step SCAD, Hard-TISP, SCAD-TISP, elastic net (eNet), and Hybnd-TISP; the 
last two both have two regularization parameters. 
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Fig 2. Histograms of the 100 Hybrid-TISP coefficient estimates for all of the 43 pre- 
dictors in the prostate example. 



sparsity and accuracy. But for a nonorthogonal regression matrix, there is great 
difficulty in both investigating the performance in theory and solving the prob- 
lem in computation. TISP provides a simple and efficient way to tackle this. 

Somewhat different than other studies, we started from thresholding rules 
rather than penalty functions. Indeed, there is a universal connection between 
them. But a drawback of the latter is its non-unique form: different penalties 
may result in the same estimator and the same thresholding. The main con- 
tribution of this paper is the study of a class of 0-estimators satisfying (3.11), 
which can be naturally computed by TISP, and are associated with penalized 
regressions. With a carefully designed thresholding rule, we obtained a good 
estimator for model selection and shrinkage. Starting from & greatly facilitated 
the computation and the analysis. In fact, some penalty designs may even have 
a better explanation from 0, or equivalently, the ^/'-function — for example, the 
SCAD-pcnalty (recall that it is defined by its derivative) seems to originate from 
Hampcl's three-part redescending tp. Conversely, we can use TISP to compute 
M-estimators in robust statistics as described in Section 3.3. 

Using a thresholding rule in the hard-thresholding family, TISP gives good 
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Fig 3. Proportions of the Hybrid-TISP coefficients being nonzero over the 100 bootstrap 
replications in the prostate example. Eight variables have nonzero coefficient estimates 
more frequently than zero estimates. 



selection results. Our novel Hybrid-TISP, accomplishing a fusion between Iq- 
penalty and Z2-penalty based on the hard-thresholding and the ridge thresh- 
olding, shows superior performance and beats the commonly used methods in 
both test-error and sparsity. The hybrid penalty function (5.2) may look a bit 
odd, but is quite natural and simple from the point of view of thresholding; 
see (5.1). It is worth mentioning that in contrast to [2, 3, 19], where more than 
one tuning parameter is considered a drawback and unneccssity, we believe a 
good procedure should have two explicit regularization parameters to control 
and balance selection and shrinkage. 

We assume the penalty function P is dependent on f3 and A only. Therefore 
the iterative weighting, substituting the nonnegative garrote [19] for Q in TISP, 
is not covered by the studies in this paper. In fact, with (3 involved in P, it 
might be difficult to optimize in the second step of the mechanism introduced 
in Section 2. 

The solution path associated with a nonconvex penalty is generally not con- 
tinuous in A. For example, even for the transformed Zi-penalty in Example 3 
which is diffcrcntiable to any order on (0,-|-oo), the solution path still has no 
A-continuity practically. Hence a pathwise algorithm is not appropriate here. 
Empirically, using a zero estimate as the start in nonconvex TISPs works pretty 
well. We conjecture that it leads to an estimate with some least norm prop- 
erty. Take Hard-IPOD as an example: this roughly means that we were looking 



Y. She/TISP for Model Selection and Shrinkage 



23 



for the local minimum of the Zo-pcnalized regression that is closest to zero in 
building a parsimonious model. 

The generalization of TISP to GLM seems straightforward; we will inves- 
tigate this topic in the next paper. TISP fits perfectly into the Accelerated 
Annealing [27] and thus can be used in the generic sparse regression with cus- 
tomizable sparsity patterns, such as the supervised clustering problem. Other 
future studies include developing some acceleration techniques for TISP (like 
the relaxation and asynchronous updating [27]) and deriving some risk oracles 
in theory. 
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Appendix A: Proofs 

A.l. Proofs of Theorem 3.1, Proposition 3.2, and Proposition 3.3 

Let's consider the orthogonal case first. Define Q(7) = j|7 — ol\\\/2 + F(7; A), 
where a is a known vector. Let 7^ = argmin(5(7). By the construction of P 
and Proposition 3.1, 7^ satisfies 7^ — ct + s(7o; A) ~ 0. 



Q{io + h)-Qho) 



> 



l\ho + h-a\\l-^\\j^-cx\\l + P(7„ + h; A) - P(7„; A) 
l\Ml+ < h,j,-a> +P{j, + h;X)-P{j^;X) 
\\Ml + [Pho + h:X)- P(7o; A)- <h,s>) 

\\\h\\i-\yHh^\h'{i-H)h. 



This inequality is due to the BCC (3.7). On the other hand, we know 



In summary, we get 



Qho + h)- Q(7o) > 2^ 



(A.l) 



for both A = I H and A = 0; formally, we write A= [I - H)yG. Note that 
(A.l) is a global result for any h. 
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Now look at the TISP. Recall the g in (2.2) is 

5(/3,7) = i||X7 - y||2 + P(7; A) + ^(7 - f3fil - S)(7 - /3). 
Then given /3, we can write g as 

5(^7) = l\h-{{I- S)/3 + + C(X,y,/3), 

and apply (A.l) with a = (J - S)/3 + X'^y, 

.9(A7o(/3) + M-.9(/3,7o(/3)) > ^/i^((^-^^)VO)/i, \fh. (A.2) 
Correspondingly, for the TISP iterates /S'-'-' , we have 

f(M'^'^) + lif^^'^'^ f3^'YiI S)(/3(^+'^ - f3^'^) = g{f3^'\f3^'^'^) 
< g{f3'-'\f3^'^) - i(;3^^+'' - /3*^')^((/ - if) V 0)(/3'-''+i) - /S'^'^) 
= /(/3(^")) - i(/3(^+i) - f3^'Y{{I -H)V 0)(/3(-'+i) - /3(-'")). 
That is, 

/(/3(^)) - /(;3(^+'^) > ^(/3^^'+'^ - - if) V + 7 - S)(/3(^'+i) - ^(^■)). 

(A.3) 

Now (3.8) and (3.9) can be obtained after simple calculations. 

As for Proposition 3.2, let /S^-"-') f3* as k 00. Under the condition 
/Zinax(S) < 1 V (2 — /iniax(if)), Thcorcm 3.1 states that 

ll^b.+i) _ ^0.)||2 < (^(^0.)) _ /(/3(>+i)))/c < (/(/3(>)) - /(^(^'=+i)))/C ^ 

That is, e((J - £)/3'^''' + X^y; A) - l3^^'''> 0. Therefore, /3* is a fixed point 
of TISP. 

Finally, we prove Proposition 3.3. Noticing that 7o(/3*) = f3* , we get the 
following inequality from (A.2) 

g{f3*,f3* +h)-g{f3*,f3*) > ^h^ {{I - H) W 0)h, yh. 

Since g(r,r) = /(r), 

fiP* +h) + ^h^{I ~ > f{f)*) + i/i^((J - if) V 0)h, Vh 
^ /(/3* +/i)-/(/3*) > i/i^((f -ff) VO + £-f)/i, V/i. 
Therefore, if /i(S) > /Xmax(ff ), /3* is a global minimizer of /. 
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A. 2. Proofs of Theorem 4-1 and Theorem 4-2 



These theorems have all been essentially proved in [27]. We provide a self- 
contained proof as follows. All inequalities and the absolute value '||' are un- 
derstood in the componentwise sense. Assume, for the moment, X has been 
column- normalized such that the diagonal entries of S = X are all 1. Let 
S/ = XjX/, S/j. = X'^Xp for any index sets /,/'. 

Lemma A.l. Assume S„z is nonsingular. The TISP estimate (3 satisfies the 
following equations 



^z.nz'SjXl,)e + TY:,,n^-Ejsgn0^,-X/kl) - Tsgn0,-X/kl) 



S-,^(X,l,e-r.Tn(/3„,;A/fco^)) 



nz z.nzh' z 



(A.4) 
(A.5) 



z,nz^rLZ ^nz^z- 



/3, 

where S'^ ~ S 

This can be obtained directly from (4.2). 

Lemma A. 2. Let z ~ N{0, Ddxd), z' ~ N{0, Adxd), where D is a diagonal 
matrix. Assume the diagonal entries of A, denoted by diag{A), are the same as 
those of D, i.e., diag{A) ~ diag{D). Then 



P(max|2;'| > c) < P(max|z| > c), 

for any c. 

This is clear from Sidak's classical result [31] in 1967. 



(A.6) 



To prove Theorem 4.1, let X'/ = X] 



A ^ 
V ^ 



Yl^^nz'^nlX'^z and define 



X;e + rS,,„,S-isgn0„j| <t} 

^ l^^ni^l <\(^nz\ for any s satisfying |s| < l| 



"^r,} X"^ 



nz nz 



Clearly 1 - < P(A^ U V) < P(A=) + P(t/'=). 

Let S^,„2 = [vi,- ■ ■VdJ'^ , then ||t!i||2 < Ksfd^:,. So [wf S~]sgn(^„^)| < 



l^.l|2-||s,: 



2 • 



sgn(/3„^)||2 < ndnz/tJ. and S^,„^S„^^sgn(/3„^) < nd^zly- It 



follows that P(yl=) < P (^|max X'^ e > (1 - Kdn^/Aijrj) 



Define = X'/ e e 3?''-. Note that X'/ X'^ = Sy.. Thus e'^ - 7V(0, a'^S:,). 

i-^T - - - - 



Since diag(S2) = 1 and diag(S^,„2S„^^S^_„^ 
Lemma A. 2 states that 



P(A'=) < P max|e'/|-o->7-(l 



) = {vjll-lv,\ > 0, diag(5,) < 1. 



< dyP \e''\>T{l 



= 2d,$([A/, +oo)). 
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where e" ~ N{0, Id,xd,)- Using the standard bound of the normal tail proba- 
bility, we get PiA") < 2d^Lp{M)/M, where M = {l - ^) J. 

To bound PiV^), suppose the spectral decomposition of Tinz is given by 
= UDU^ with U = [iti,--- ,Ud^J\^ , then we can represent as 
[ujD-^Uj\^^^^^^^^, and as Yl%\sjujD-^u 



. It follows that 



< u 



diag(S;^i) < i and |S-is| < Therefore, P(F^) < P 

where Lq = niin |/3„^| — rdnz/tJ-- 

Because Y^~^X'^^e ~ iV(0, (t^S"^^), and we have shown diag(S~2^) < i, 
applying Lemma A. 2 yields 

P(F^) = P(max S;iX^,e > Lq) < P(max|e^'| > Lq^) < 2d„,$([L, (x))), 

cr 

where e'^ ^ N{0, Id„,xd„J- Hence, 

l-Pe< P{A' U y^) < P{A-) + P{V') < 2dz^ip{M) + 2dnz^^{L), (A.7) 

where L:=^(min|/3„J-l^). 

In fact, we can get something slightly stronger than (A.7). Observing that 
X ^ e is independent of X^^e, we have 

Ps > PiA nV)= P{A) ■ P(y) > [1 - 2^{~-M)f' [1 - 2^{~L)f"' . (A.8) 

(A.8) implies (A.7). 

We assumed xjxi = li = l,---,din the above derivation. If the Z2-norm 
of each column of X is no greater than CTmax, we only need to replace /3, /3, t, 
by f3 ■ iTmax, /3 • CTmax, T / Gmax, respectively. The proof of Theorem 4.1 is now 
complete if (Tmax = \/n. 

For Theorem 4.2, noticing that (a) sgrL(u) = 0,V|u| > cr by definition and 
(b) Ps > P (M'^ = 0, and \ > cr(^)) with === l3^, t'^'^ = r/V^, we 
can prove it following the same lines. 

A. 3. Proof of Theorem 4.3 

Use the same symbols and notations as defined in Appendix A. 2. Let = |j/92||2, 
rnz = WPnz - f^nzWl- Fi'om (A.5), we have 

Rr^z < 3 i^EWH-^XlMll + r^i?||S;iigH(^„J|i^ + • P.) , 

due to Cauchy-Schwarz inequality and the fact that 1 1 S , „ ^ j 1 2 = max 1 1 S ^ „ , a | ] 2 < 

"' ~ l|a||2 = l 

KVdAi^. Thus Rnz < 3 (aHr(i:-^) +t^^ + k^^^^ ■ P^) and (4.8) holds. 
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To get (4.9), we need the following result about ^max{Sz the largest eigen- 



value of ^: flmax{S, ^) < M 1 - K 



^2 , d^d„^ 



This is true by noting that 



Sz = Xz is semi-positive definite and (Sz) >v — K^dzd 
By (A. 4) and the results in Appendix A. 2, we have 

Tz < SfilaASz') \\\xfe\\l + r'WEz,n.-S-z^0nM + r'\\^^C^ 



z)\\2 



Therefore, 

Rz < 3flf^^^{Sz^) 



iX'z elU ■ Ia- + { T dz ■ K 



E{\\xfe\\l ■ l^c) + ( 1 + ] r^dz ■ 2dzi-MM) 



2 ^7lZ 



■ M' 



< 3til,,iS-') ■ dzE 



max |e'j^ I)' ; max |e'j^ | > r(l — k- 



Since for a random variable z with probability density p{-) and a > 0, 



Eiz"^; z>a)= f fp{t)dt = f 

J a J a 



rp{t)dt = / 2sP{z > s)ds + a^P{z > a), 



it follows from Lemma A. 2 that 



E 



(max|e']^|) ,max|e']^| > t(1 — k— ^) 



< E [max |e'/p • cr^; max |e'/| > M] . 



(Recall that e" ~ N{0, Id^xd^)-) The density of max je'/j is given by 2dz(p{t){l' 
2$(-t))'^--i. It is easy to get 

E [max |e'/p; max |e'/| > M] < 2dz / t^Lp{t)dt < 2dz{M + l/M)Lp{M). 



Hence, 



Rz < 3— ( 1 - K 



2 ^z djiz 
-2j2 I, ,2 



2a'di{M+—\^{M) 



M 



1 



dl{K,M + K2 — MM) 



~ ' M' 

Using a similar scaling argument we obtain Theorem 4.3. 
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A. 4- Proof of Theorem 4-4 

Let f3 , /3 denote the hard- and soft-thresholding estimates with threshold 

~ H ~ s ~ 

value rcr. It is easy to see /3 , /3 , and /3 all have the same sign and (3 is sand- 
wiched by the other two. Therefore, E0 - /3||i < E £;(max((^f - /3,)2, (/3f - 
PiY)). It is sufficient to study soft- and hard-thresholdings in the univariate 
case. 

Let y = 1^1 + e (all are scalars) with e ^ ^(0, 1), and ps(t, /i), ph{t, /i) be the 
risks of the soft- and hard-thresholdings with parameter t. It is well known [7, 12] 
that 

Ps{t, ^l) < min(ps(T, 0) + 1 + r^) < min(?^ + 1 + t^) (A.9) 

T 

for any r > 0. Yet it seems that there is no such explicit nonasymptotic bound, 
or a complete proof for the hard-thresholding rule. This short appendix is mainly 
to give some details about this. 

Our goal is to show the following on the basis of [12] 

Ph{t,ii) < l + T2forT>l (A.IO) 
Ph{t,p) < ph{t,0) + 1.2p^. (A.ll) 

Dohoho & Johnstone have shown (A.IO), and (A.ll) for < /.( < t, but it is 
technically difficult to use the second derivative to prove (A.ll) for any /i > 0. 
Let g = dpn/dp, — 2.4/i, and ph{t, p) is known to be [12, 19] 

1 + {p^ - 1)($(t -p)- $(-T - p)) + (r + pMt + p) + {t- p)^{t - p), 

where iy9, $ are the standard normal density and distribution functions, respec- 
tively. One may observe that sup^>Q ^(0, p) < suPt.>o .g(r, 0) = 0, which is trivial 
to verify. So it is sufficient to show that for any (r, /i) > 0, there exists some 
9 G [tt, Itt] such that the directional derivative Dgg at (r, /i) is greater than 0, 
or 39T,fj. & [0, §] s.t. Dgg{T,p) < 0, because g is smooth enough. 

Consider a uniform direction 9 = j, and let h = Dgg = + ^)/\/2. We 
assume p > t in the following. Then simple calculations yield 

h{T,p) = V2 ■ [{-PiT + p) - <i>{p - t)) - p(p{p - t) + 
(p{T + p){t^ + 3t^p + 2t/? +p-2t) - 1.2] 

< V2{0.5 + {T + pfip{T + p)-l.2) 

< (0.5 + 0.5 - 1.2) < 0. 

Therefore, 

Ph{t,p) < min(p/f(r,0) -|- 1.2p^,l + T^) < min(2(p(T)(r + -) + l.2p^,l + T^) 

(A.12) 
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for any r > 1. Now, combining (A. 9) and (A. 12) we can bound the univariate 
TISP risk 



2, . /2(^(t) , 1.2 2 



< max(ps(T,//),Pff(r,^)) < (1 + T )min h — — , 1 

\ T 1 + 

for any r > 1. Theorem 4.4 thus foUows. 

Finally it may be worth mentioning that although applying Stein's lemma 
is one possible way (see, for example, Gao [19]), it docs not handle the oracle 
bound well for an estimator very close to hard thresholding — like Zou's oracle 
bound for the adaptive lasso [38] , because the hard-thresholding function is not 
weakly differentiablc. (Due to an error made in the derivative calculation, Zou's 
oracle bound for the adaptive lasso defined by min i||y — X/3||2 + T^z/;i|/3i| with 

(X IPoisA'"' should be (2 log 71 + 5 + 4??) • mm{l3f, a^) + aV(20n^)) , 
with the first factor being (2 logn + 5 + Arj) instead of (2 logn + 5 + 4/ry), which 
diverges as rj goes to infinity. Sec [27] for detail.) 



A. 5. Proof of Theorem 5.1 



In the proof, all inequalities and the absolute value '||' are understood in the 
componentwise sense. 

First we calculate the generalized sign for the hybrid-thresholding (5.1) 

re[-i,i], ifM = 
^iu;X,r,)^ lo, if|u|e(0,^) . (A.13) 

And note that t(A) = A. The generalized sign form of the 0-cquation for Hybrid- 
TISP estimate ^ from (3.10) is 

S/3 = X^y - AigH (^;3; A , (A.14) 

where fco = \\X\\2. 

The proof still follows the lines of the proof for Theorem 4.1. Assume, for 
the moment, X has been column-normalized such that the diagonal entries of 
S = X'^X are aU 1. Clearly, = 0, |/3„J > is a sufficient condition for 

the zero consistency of /3. From Lemma A.l, the 0-equation is equivalent to 
S J, = {Xl - S,,„3S;,iX^Je + AS,,„,S-iSgii(^„,; ^, ^) - A§gH(^,; ^, ^) 

^nz = ^nz + ^nz nz^ ~ ^^%^{i^nz\ p" i p")) ^ ^nz ^z.nzf^^ 
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Our calculations based on the definition of sgn show that 
Define 

A ^ {|{xf-S,,„,S-;[/-,7(S„,+r//)-i]Xl.}e + ,7S,,„4S„,+,77)-i/3„,| < a} 



V 



> ^ 



Then < PiA" U 1/^=) < PiA") + PiV). 

To bound the first probability, noticing that \7-i'Sz.nz{'^nz + vl)^^ f^nz\ ^ 
i^Vd^Tzj^^Wf^nzh, wc have P(A^) < P (uia^\e[\ > X - nVd^zj^m^zh) , 
where e'^ = [x^ - S,,„,S-i[/ - 77(S„, + Ty/)-!]^^,} e. Since 

diag(uar(e'i)) < a'^ disigCE < It follows from Lemma A.2 that PiA") < 
P (max |eiV > A - KVd^_-^\\f3^J\2) , where e'/ ^ iV(0, 7d„.xd„ J- Define M" 



■ fJ.+V ' 

\ (a - K-f-V^||/3„J|2). WeobtainF(A'=) < 24$([Ar , +(»)) < 2dMM")/M" 



Next consider Let e^, = (S„^, + r/7)-iX^^e. Then 

P(n</'(max|4|>.-^). 
Since ?;ar(ey = (S„, + r;7)-iS„,(S„, + r/7)-V2, diag(var(e^)) < By 



Lemma A.2 again, we know P(y) < P (^maxlcjl > (^i - k^jj , where 

e'2' ^ A^(0,7d_xd„J. Define L" = ^ It follows that P(l/^) < 

2d„,$([L", +CX))) < 2dnz^{L")/L". 

We assumed ai^a;; = li = l,---,din the above derivation. If the Z2-norm 
of each column of X is no greater than dmax, it is not difficult to know that 
we only need to replace the /3, 0, A, ry, by /3 • tTmax, ^ ■ CTmax, X/crmnK, v/'^mi,^, 
respectively. The proof of Theorem 5.1 is now complete if (Tmax = \/n. 
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